exploratory analyis & basic preprocessing

Exploratory Visualization

plotting avarege year sales

plotting avarege year sales for each category

plotting monthly and yearly sesonality

plotting periodogram

feature engineering

comparing a few ways to merge the dataframes:

using apply method

iterating over the dataframe rows

using list comprehension

using join method

comparison result:

it's clear that join method is the fastest by a large diffrence, and sadly it was the last one to occure to me, but it was a fun experience, it made see for myself that dealing with a dataframe by iterating is the last option to take and only if no other option is available, since it was the slowest with an avarge of 9 mins, compared to apply method iterating the dataframe is almost 50% slower, the bigger the task the bigger this diffrence would be. the other two method which was both simpler and faster than iterating the dataframe with only an avarage of around 7 minutes for apply method, and an avarage of around 243ms for list comprehension that made it the second fastest way to merge.

but in the end I would use join method for it's speed, and also easier joining based on multiple columns.

preparing training features

preparing testing features

training and validating

Benchmark model :LinearRegression

Lasso

Ridge

ElasticNet

SVR

KNeighborsRegressor

DecisionTreeRegressor

RandomForestRegressor

ExtraTreesRegressor

XGBRegressor

training and validating conclusion:

it's clear that the best three models are:

3) ExtraTreesRegressor with a validation RMSLE of 0.54 and a Training RMSLE of 0.48.

2) KNeighborsRegressor with a validation RMSLE of 0.52 and a Training RMSLE of 0.46.

1) XGBRegressor with a validation RMSLE of 0.51 and a Train RMSLE of 0.28.

finnaly, it's obvious that the tree based models has actually outperformerd the linear models.

making sure that the prediction has correctly captured the seasonality

Although the model has learned the seasonality of the training data pretty well, but we can actaully see that for the validation data which is a two weeek ahead of the training data, there is still Biweekly and weekly seasonality that the model wasn't able to completely forecast.

testing the best 3 models

if you want to check out the results you can submit the resulting file by joining the competition:

https://www.kaggle.com/c/store-sales-time-series-forecasting/overview

Benchmark model :LinearRegression

using the model to make prediction in the test set, and submit the result to kaggle competetion

result: 0.49920 (RMSLE)

ExtraTreesRegressor

using the model to make prediction in the test set, and submit the result to kaggle competetion

result: 0.47770 (RMSLE)

KNeighborsRegressor

using the model to make prediction in the test set, and submit the result to kaggle competetion

result: 0.45739 (RMSLE)

XGBRegressor

using the model to make prediction in the test set, and submit the result to kaggle competetion

result: 0.45554 (RMSLE)

testing conclusion:

the testing results confirm the same order as the training and validation results